The World Happiness Report is a landmark survey of the state of global happiness . The report we use was taken from Kaggle and is the data for the year 2019. The report gained global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy and more – describe how measurements of well-being can be used effectively to assess the progress of nations. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness. The happiness scores and rankings use data from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll.

The data was taken from Kaggle(https://www.kaggle.com/datasets/unsdsn/world-happiness?select=2019.csv).

The goal of our project is to investigate correlation between happiness indicators and happiness score. We try to implement and find out the best model (based on lowest RMSE) for this data set to predict the happiness score of a country. We are using score as our response variable and all other variables other than country, overall rank, and region as explanatory variables. To accomplis our main goal we don’t need the region of the countries or the overall rank. We have thoroughly investigated the data set and created summary statistics and visualizations of different important explanatory variables. There are no missing values in our data set. We have also created a correlation plot to see if there is any severely correlated explanatory variables.

set.seed(05292022) # set seed for cross validation
summary(happy) # checking for missing data
##   Overall.rank    Country.or.region      Score       GDP.per.capita  
##  Min.   :  1.00   Length:156         Min.   :2.853   Min.   :0.0000  
##  1st Qu.: 39.75   Class :character   1st Qu.:4.545   1st Qu.:0.6028  
##  Median : 78.50   Mode  :character   Median :5.380   Median :0.9600  
##  Mean   : 78.50                      Mean   :5.407   Mean   :0.9051  
##  3rd Qu.:117.25                      3rd Qu.:6.184   3rd Qu.:1.2325  
##  Max.   :156.00                      Max.   :7.769   Max.   :1.6840  
##  Social.support  Healthy.life.expectancy Freedom.to.make.life.choices
##  Min.   :0.000   Min.   :0.0000          Min.   :0.0000              
##  1st Qu.:1.056   1st Qu.:0.5477          1st Qu.:0.3080              
##  Median :1.272   Median :0.7890          Median :0.4170              
##  Mean   :1.209   Mean   :0.7252          Mean   :0.3926              
##  3rd Qu.:1.452   3rd Qu.:0.8818          3rd Qu.:0.5072              
##  Max.   :1.624   Max.   :1.1410          Max.   :0.6310              
##    Generosity     Perceptions.of.corruption
##  Min.   :0.0000   Min.   :0.0000           
##  1st Qu.:0.1087   1st Qu.:0.0470           
##  Median :0.1775   Median :0.0855           
##  Mean   :0.1848   Mean   :0.1106           
##  3rd Qu.:0.2482   3rd Qu.:0.1412           
##  Max.   :0.5660   Max.   :0.4530
# visualization plots for selected variables
ggplot(data = happy, aes(x = GDP.per.capita, y = Score)) + geom_point() + ggtitle("GDP per capita vs. Happiness Score") + geom_smooth(method = "lm")

ggplot(data = happy, aes(x = Generosity, y = Score)) + geom_point() + ggtitle("Generosity score vs. Happiness Score") + geom_smooth(method = "lm")

ggplot(data = happy, aes(x=Score)) + geom_histogram() + ggtitle("Histogram of Happiness Scores")

happy1 <- happy %>% dplyr::select(-Country.or.region, -Overall.rank)       # dropping region and overall rank variables

#correlation plot
plot <- cor(happy1) 
corrplot(plot)

# filtering for visualizing data for some major countries
data <- happy %>% 
  filter(Country.or.region %in% c("France", "Sweden", "Italy", "Spain", "England", "Portugal", "Greece", "Peru", "Chile", "Brazil", "Argentina", "Bolivia", "Venezuela", "Australia", "New Zealand", "Fiji", "China", "India", "Thailand", "Afghanistan", "Bangladesh", "United States of America", "Canada", "Burundi", "Angola", "Kenya", "Togo")) %>%
  arrange(Country.or.region) %>%
  mutate(Country = factor(Country.or.region, Country.or.region))

# Matrix format
mat <- data
rownames(mat) <- mat[,2]
mat <- mat %>% dplyr::select(-Country.or.region, -Overall.rank,-Country)

# Heat map for the major countries including all variables

p <- heatmaply(mat, 
        dendrogram = "none",
        xlab = "", ylab = "", 
        main = "",
        scale = "column",
        margins = c(60,100,40,20),
        grid_color = "white",
        grid_width = 0.00001,
        titleX = FALSE,
        hide_colorbar = TRUE,
        branches_lwd = 0.1,
        label_names = c("Country", "Feature:", "Value"),
        fontsize_row = 5, fontsize_col = 5,
        labCol = colnames(mat),
        labRow = rownames(mat),
        heatmap_layers = theme(axis.line=element_blank())
        )
p

In this plot we can see the values of our explanatory variables for some major countries. We can hover over this interactive map and see the spceific values for each variable.

summary(happy1) 
##      Score       GDP.per.capita   Social.support  Healthy.life.expectancy
##  Min.   :2.853   Min.   :0.0000   Min.   :0.000   Min.   :0.0000         
##  1st Qu.:4.545   1st Qu.:0.6028   1st Qu.:1.056   1st Qu.:0.5477         
##  Median :5.380   Median :0.9600   Median :1.272   Median :0.7890         
##  Mean   :5.407   Mean   :0.9051   Mean   :1.209   Mean   :0.7252         
##  3rd Qu.:6.184   3rd Qu.:1.2325   3rd Qu.:1.452   3rd Qu.:0.8818         
##  Max.   :7.769   Max.   :1.6840   Max.   :1.624   Max.   :1.1410         
##  Freedom.to.make.life.choices   Generosity     Perceptions.of.corruption
##  Min.   :0.0000               Min.   :0.0000   Min.   :0.0000           
##  1st Qu.:0.3080               1st Qu.:0.1087   1st Qu.:0.0470           
##  Median :0.4170               Median :0.1775   Median :0.0855           
##  Mean   :0.3926               Mean   :0.1848   Mean   :0.1106           
##  3rd Qu.:0.5072               3rd Qu.:0.2482   3rd Qu.:0.1412           
##  Max.   :0.6310               Max.   :0.5660   Max.   :0.4530
sum(is.na(happy1))    # check missing values
## [1] 0
str(happy1) #checking for variable types
## 'data.frame':    156 obs. of  7 variables:
##  $ Score                       : num  7.77 7.6 7.55 7.49 7.49 ...
##  $ GDP.per.capita              : num  1.34 1.38 1.49 1.38 1.4 ...
##  $ Social.support              : num  1.59 1.57 1.58 1.62 1.52 ...
##  $ Healthy.life.expectancy     : num  0.986 0.996 1.028 1.026 0.999 ...
##  $ Freedom.to.make.life.choices: num  0.596 0.592 0.603 0.591 0.557 0.572 0.574 0.585 0.584 0.532 ...
##  $ Generosity                  : num  0.153 0.252 0.271 0.354 0.322 0.263 0.267 0.33 0.285 0.244 ...
##  $ Perceptions.of.corruption   : num  0.393 0.41 0.341 0.118 0.298 0.343 0.373 0.38 0.308 0.226 ...
train_index <- createDataPartition(happy1$Score, p = 0.8, list = FALSE)   # 80-20 split for creating training and test datasets

train <- happy1[train_index, ]   # training set

test <- happy1[-train_index, ]   # test set

nearZeroVar(train, saveMetrics = TRUE)  # check which predictors are zv/nzv
##                              freqRatio percentUnique zeroVar   nzv
## Score                         2.000000      99.21875   FALSE FALSE
## GDP.per.capita                1.500000      94.53125   FALSE FALSE
## Social.support                1.500000      93.75000   FALSE FALSE
## Healthy.life.expectancy       1.000000      77.34375   FALSE FALSE
## Freedom.to.make.life.choices  1.000000      83.59375   FALSE FALSE
## Generosity                    1.333333      78.12500   FALSE FALSE
## Perceptions.of.corruption     1.333333      78.12500   FALSE FALSE
# create feature preprocessing blueprint

blueprint <- recipe(Score ~ ., data = happy1) %>%
  step_normalize(all_numeric_predictors()) # scaling and normalizing all numeric variables

# preparing the blueprint
prepare <- prep(blueprint, training = train) 
baked_train <- bake(prepare, new_data = train)
baked_test <- bake(prepare, new_data = test)

We have created training and test data sets by splitting the original data in an 80-20 split (80% is training data set). Considering that we only have 158 observations we decided to split the data into 80-20 as we will have more data for training set than a 70-30 split. We have initially investigated the data set for missing values and near zero/zero variace features. We don’t have any missing values or any zv/nzv features. We don’t have any catorgical features, as a result we don’t need any label encoding. We have normalized all of our numerical variables to get an appropriate blueprint. After completing all of the steps needed for our blueprint we have created the blueprint and baked our test and training data set using the blueprint.

library(GGally)
ggpairs(happy1, title="Correlogram ") # correlation plots to see possible relationship between the variables

cv_specs <- trainControl(method = "repeatedcv", number = 5, repeats = 5) # setting cv specifications for all of the models

# KNN regression
k_grid <- expand.grid(k = seq(1, 20, by = 1)) # specifying the tuning parameters for knn regression

# implementing cv to find optimal k value for the knn regression
knn_fit <- train(blueprint,
                  data = train, 
                  method = "knn",
                  trControl = cv_specs,
                  tuneGrid = k_grid,
                  metric = "RMSE")

knn_fit
## k-Nearest Neighbors 
## 
## 128 samples
##   6 predictor
## 
## Recipe steps: normalize 
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 102, 103, 101, 103, 103, 103, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    1  0.6716856  0.6727324  0.5383846
##    2  0.5745947  0.7367814  0.4557909
##    3  0.5355136  0.7711673  0.4268523
##    4  0.5253204  0.7788887  0.4103186
##    5  0.5207239  0.7849988  0.4067351
##    6  0.5190942  0.7890778  0.4110125
##    7  0.5137864  0.7943554  0.4043546
##    8  0.5122487  0.7979665  0.4027064
##    9  0.5176533  0.7950370  0.4046804
##   10  0.5232152  0.7928953  0.4105461
##   11  0.5316985  0.7876189  0.4168114
##   12  0.5408344  0.7815651  0.4242502
##   13  0.5484834  0.7777412  0.4306598
##   14  0.5520855  0.7761107  0.4334259
##   15  0.5576345  0.7721855  0.4361822
##   16  0.5617336  0.7698495  0.4402463
##   17  0.5611915  0.7721830  0.4381756
##   18  0.5642499  0.7722410  0.4417881
##   19  0.5676930  0.7707514  0.4445620
##   20  0.5700658  0.7708918  0.4481605
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 8.
ggplot(knn_fit)

min(knn_fit$results$RMSE) # cv RMSE for knn regression
## [1] 0.5122487
knn_fit$results
##     k      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
## 1   1 0.6716856 0.6727324 0.5383846 0.08901254 0.08509549 0.07826967
## 2   2 0.5745947 0.7367814 0.4557909 0.08811573 0.08649872 0.07395478
## 3   3 0.5355136 0.7711673 0.4268523 0.08551245 0.06905233 0.07309664
## 4   4 0.5253204 0.7788887 0.4103186 0.08089025 0.06924853 0.06188832
## 5   5 0.5207239 0.7849988 0.4067351 0.07442011 0.06458766 0.05804139
## 6   6 0.5190942 0.7890778 0.4110125 0.07045122 0.06325507 0.05978091
## 7   7 0.5137864 0.7943554 0.4043546 0.07458523 0.06421677 0.06281384
## 8   8 0.5122487 0.7979665 0.4027064 0.07459189 0.06502967 0.06591925
## 9   9 0.5176533 0.7950370 0.4046804 0.06985044 0.06208819 0.06218902
## 10 10 0.5232152 0.7928953 0.4105461 0.07154932 0.06316004 0.05966839
## 11 11 0.5316985 0.7876189 0.4168114 0.06659227 0.06070177 0.05531366
## 12 12 0.5408344 0.7815651 0.4242502 0.06487918 0.06091166 0.05517103
## 13 13 0.5484834 0.7777412 0.4306598 0.06465006 0.05795219 0.05202640
## 14 14 0.5520855 0.7761107 0.4334259 0.06188118 0.05784074 0.04967546
## 15 15 0.5576345 0.7721855 0.4361822 0.06453926 0.05754519 0.04866091
## 16 16 0.5617336 0.7698495 0.4402463 0.06452682 0.06154079 0.04839653
## 17 17 0.5611915 0.7721830 0.4381756 0.06466809 0.05868196 0.04817699
## 18 18 0.5642499 0.7722410 0.4417881 0.06551335 0.06151041 0.04924281
## 19 19 0.5676930 0.7707514 0.4445620 0.06433729 0.06114380 0.04727939
## 20 20 0.5700658 0.7708918 0.4481605 0.06290682 0.06053865 0.04657724
# MLR model
# implementing cv to find RMSE for linear regression model
lm_fit <- train(blueprint,
                  data = train, 
                  method = "lm",
                  trControl = cv_specs,
                  metric = "RMSE")

lm_fit
## Linear Regression 
## 
## 128 samples
##   6 predictor
## 
## Recipe steps: normalize 
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 104, 102, 103, 100, 103, 103, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.5105353  0.7908418  0.4009258
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Ridge regression model
lambda_grid <- 10^seq(-3, 3, length = 100)   # grid of lambda values to search over

ridge_cv <- train(blueprint,
                  data = train,
                  method = "glmnet",   # for ridge regression
                  trControl = cv_specs,
                  tuneGrid = expand.grid(alpha = 0, lambda = lambda_grid),  # alpha = 0 implements ridge regression
                  metric = "RMSE")


# results from the CV procedure

ridge_cv$bestTune    # optimal lambda
##    alpha    lambda
## 37     0 0.1519911
min(ridge_cv$results$RMSE)   # RMSE for optimal lambda
## [1] 0.5020319
# The LASSO model
# implement CV

lasso_cv <- train(blueprint,
                  data = train,
                  method = "glmnet",   # for lasso
                  trControl = cv_specs,
                  tuneGrid = expand.grid(alpha = 1, lambda = lambda_grid),  # alpha = 1 implements lasso
                  metric = "RMSE")


# results from the CV procedure

lasso_cv$bestTune    # optimal lambda
##    alpha      lambda
## 11     1 0.004037017
min(lasso_cv$results$RMSE)   # RMSE for optimal lambda
## [1] 0.5125943
# MARS model

param_grid <- expand.grid(degree = 1:3, nprune = seq(1, 100, length.out = 10)) # setting parameters for mars model

# implement CV
mars_cv <- train(blueprint,
                 data = train,
                 method = "earth",
                 trControl = cv_specs,
                 tuneGrid = param_grid,
                 metric = "RMSE")

mars_cv$bestTune   # optimal tuning parameters
##   nprune degree
## 2     12      1
min(mars_cv$results$RMSE)   # CV RMSE
## [1] 0.5141464
# Tree model
#implement cv
tree_cv <- train(blueprint,
                 data = train,
                 method = "rpart",
                 trControl = cv_specs,
                 tuneLength = 20,
                 metric = "RMSE")

tree_cv$bestTune   # optimal tuning parameter
##   cp
## 1  0
min(tree_cv$results$RMSE)   # CV RMSE
## [1] 0.5913842
# Bagging
bag_fit <- bagging(Score ~ ., 
                  data = baked_train,
                  nbagg = 500,   # number of trees to grow (bootstrap replications)
                  coob = TRUE,   # yes to computing OOB error estimate
                  control = rpart.control(minsplit = 2, cp = 0, xval = 0))  # details of each tree
                  
bag_fit #results for bagging 
## 
## Bagging regression trees with 500 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Score ~ ., data = baked_train, nbagg = 500, 
##     coob = TRUE, control = rpart.control(minsplit = 2, cp = 0, 
##         xval = 0))
## 
## Out-of-bag estimate of root mean squared error:  0.5025
# Random Forest model
param_grid <- expand.grid(mtry = seq(1, 6, 1),    # sequence of 1 to number of predictors (6)
                          splitrule = "variance",   
                          min.node.size = 2)   # for each tree

# implement cv
rf_cv <- train(blueprint,
               data = train,
               method = "ranger",
               trControl = cv_specs,
               tuneGrid = param_grid,
               metric = "RMSE")

rf_cv$bestTune$mtry   # optimal tuning parameter
## [1] 2
min(rf_cv$results$RMSE)   # CV RMSE
## [1] 0.4867644

Our response variable is a numerical variable so we chose to implement different regression models. We didn’t choose GAM fit model and we haven’t used any polynomial terms for any variables while doing the MLR. We didn’t choose GAM fit because it was out of our scope to use cross validation for this method. We have checked the correlation plot and we have seen no polynomial trend in any of the variables. We have implemented our cv for KNN, MLR, Ridge, Lasso, Mars, single regression tree, and Random Forest. We have also created a bag fit model using nbag = 500. For our cv specs we have used 5 fold repeated 5 times. We have investigated the optimal parameters for each of the models and the RMSE for the optimal tuning parameters. The table below shows the results of our models.

Model CV RMSE Optimal tuning parameter
KNN 0.512 8
MLR 0.510 N/A
RIDGE 0.502 Alpha = 0 Lambda = 0.15
LASSO 0.512 Alpha = 1 Lambda = 0.004
MARS 0.514 nprune = 12
Single Regression Tree 0.591 cp = 0
Bagging 0.5025 nbag = 500
Random Forest 0.486 mtry = 2

From the table we can see that the optimal model was Random Forest with the lowest RMSE of 0.487 with optimal mtry=2.

# final model using rf
rf_fit <- ranger(Score ~ .,
                 data = baked_train,
                 num.trees = 500,
                 mtry = rf_cv$bestTune$mtry, # using best tuning parameter from cv rf
                 splitrule = "variance",  
                 min.node.size = 2, 
                 importance = "impurity")

# variable importance
sort(rf_fit$variable.importance, decreasing = TRUE)
##               Social.support      Healthy.life.expectancy 
##                    45.765635                    40.436350 
##               GDP.per.capita Freedom.to.make.life.choices 
##                    37.375213                    12.895007 
##    Perceptions.of.corruption                   Generosity 
##                    10.792288                     5.964881
preds_rf <- predict(rf_fit, data = baked_test, type = "response")  # predictions on test set

sqrt(mean((preds_rf$predictions - baked_test$Score)^2))  # test set RMSE
## [1] 0.6027718

Based on our cv results we chose to create a final model using the random forest with mtry = 2. We have created the optimal model on our training data and tried to investigate the importance of the variables. The most important variable was social support. We have also obtained predictions using the model on our test data set. The test RMSE or the test set error for the final model was 0.603.